| ILI posts | Control posts | ILI incidence | |
|---|---|---|---|
| ILI posts | 1.000 | 0.863 | 0.773 |
| Control posts | 0.863 | 1.000 | 0.541 |
| ILI incidence | 0.773 | 0.541 | 1.000 |
Prototyping a data extraction pipeline for bluesky.social and exploration of bluesky user activity for influenza like digital disease detection
Digital Epidemiology 2025, Hasselt University
2025-04-10
bluesky social networkbluesky APIbluesky messagesbluesky: general aspectstwitter in user experienceDecentralized User Identifier (DID)
Personal Data servers (PDS)
DIDs and affiliated contents are portable between PSDs
Users can choose, prioritize and develop feed generators and content labelers
twitter by Elon MuskX in Brazil, presidential election in the USblueskyGoogle scholar search : “bluesky” AND “social” since 2022
43 articles
main topics:
X to bluesky 2024no results for
searchPosts API methodselected parameters:
q: search querysince, until: defining search periodlimit: max. 100 posts
deterministic search
allows exhaustive sampling
defined in the SDK documentation
fields (selection):
uri: unique post identifierauthor: contains did which allows to retrieve user profilerecord: contains the text and time information of the message
langs: language(s) detected by the bluesky serverembedded: any embedded media (images, other posts, etc …)in contrary to former twitter post metadata, no geoinformation
Feedgens
Labelers
no geo information
getProfiles API endpointbluesky post data for digital disease surveillance
Implementation of a continuous surveillance pipeline
focused on French bluesky posts (data volume constraint)
extraction using list of keywords 1
extraction of
WHO Flumart
Data analysis starting from 2023-08-01
| ILI posts | Control posts | ILI incidence | |
|---|---|---|---|
| ILI posts | 1.000 | 0.863 | 0.773 |
| Control posts | 0.863 | 1.000 | 0.541 |
| ILI incidence | 0.773 | 0.541 | 1.000 |
\(Y_w:\) ILI Incidence in week \(w\)
\(X_w:\) Input features obtained in week \(w\)
\[Y_{w+1} = f(X_w, X_{w-1}, X_{w-2})\]
| Dataset | MAE* | RMSE |
|---|---|---|
| Training | \(23.96\) | \(33.93\) |
| Validation + | \(56.54\) | \(56.54\) |
* Mean absolute error, incidence per 100,000
+ mean over all validation runs
model agnostic feature importance procedure
random shuffling of single input features
json structured output option for convenient data processingAnalyze the following tweet-like message to determine if it describes the user's own influenza-like illness (ILI). ILI is defined by:
- Fever ≥38°C (100°F) **AND**
- At least one respiratory symptom (cough or sore throat) **PLUS**
- Additional systemic symptoms (headache, muscle aches, chills, fatigue, nasal congestion)
...
{ ... bluesky message dynamically inserted here ... }
Extraction using google Gemini API
Bon bah ça, c'est fait. En arrêt maladie jusque Lundi, pour état grippal.
Pas le Covid, ni la grippe, mais des migraines, courbatures, impression de pas avoir dormi depuis 15 jours.
Je vais peut-être pouvoir finir mon doc sur Boris Vian sur Arte, et finir mes séries.
Et vous faire chier ici 😈
Well, that's it. On sick leave until Monday, for influenza.
Not the covid, nor the flu, but migraines, aches, impression of not having slept for 15 days.
I may be able to finish my doc on Boris Vian on Arte, and finish my series.
And piss you here 😈
migraines,courbatures
Grippe aviaire : les coupes budgétaires de Trump amplifient la menace pandémique www.lepoint.fr/tiny/1-2586310 #Santé via @lepoint.fr
Aviary flu: Trump's budget cuts amplify the pandemic threat www.lepoint.fr/tiny/1-2586310 #health via @lepoint.fr
nan
| LLM ILI posts | ILI incidence | Control posts | |
|---|---|---|---|
| LLM ILI posts | 1.000 | 0.793 | 0.812 |
| ILI incidence | 0.793 | 1.000 | 0.557 |
| Control posts | 0.812 | 0.557 | 1.000 |
| Dataset | MAE* (LLM filtered) |
|---|---|
| Training | \(26.64\) |
| Validation | \(58.87\) |
* Mean absolute error, incidence per 100,000
bluesky = promising data sourceinvestigate impact of LLM filtering on model performance
modeling of weekly ILI incidence based on message content
continuous data acquisition pipeline (WIP)
User localization based on profile
monitoring of bursts in user activity crucial
repeating the analysis for another country (e.g. Germany)
graph LR
subgraph kestra
dlt(dlt) --- posts
llm --- bqstaging
llm -- annotation --> bqstaging
posts --> bqstaging[<b>GBQ</b> \n stage area \n 1 table per kw]
dlt -- housekeeping --> count
dlt -- case data --> who_tables
dlt -- case data --> cdc_tables
subgraph BigQuery data lake
bqstaging
who_tables
cdc_tables
count[post counts table]
end
bqstaging --- dbt
who_tables --- dbt
cdc_tables --- dbt
count --- dbt
dbt --> bq[Google \n BigQuery]
subgraph BigQuery data warehoue
bq
end
end
bsky[bsky API] --> dlt
WHO --> dlt
CDC --> dlt
bq --> looker[Looker studio \n dashboard]
bq -- python --> stat1[Statistical analysis]
bq -- python --> stat2[Machine learning, modeling]
Open source implementation
dlt)dbt)kestraavailable at: https://github.com/kantundpeterpan/bluesky_ddd_influenza